DescribeX: A Framework for Exploring and Querying XML Web Collections
نویسنده
چکیده
DescribeX: A Framework for Exploring and Querying XML Web Collections Flavio Rizzolo Doctor of Philosophy Graduate Department of Computer Science University of Toronto 2008 The nature of semistructured data in web collections is evolving. Even when XML web documents are valid with regard to a schema, the actual structure of such documents exhibits significant variations across collections for several reasons: an XML schema may be very lax (e.g., to accommodate the flexibility needed to represent collections of documents in RSS feeds), a schema may be large and different subsets used for different documents (e.g., this is common in industry standards like UBL), or open content models may allow arbitrary schemas to be mixed (e.g., RSS extensions like those used for podcasting). A schema alone may not provide sufficient information for many data management tasks that require knowledge of the actual structure of the collection. Web applications (such as processing RSS feeds or web service messages) rely on XPath-based data manipulation tools. Web developers need to use XPath queries effectively on increasingly larger web collections containing hundreds of thousands of XML documents. Even when tasks only need to deal with a single document at a time, developers benefit from understanding the behaviour of XPath expressions across multiple documents (e.g., what will a query return when run over the thousands of hourly feeds collected during the last few months?). Dealing with the (highly variable) structure of such web collections poses additional challenges. 1 http://www.rss-specifications.com/ 2 http://oasis-open.org/committees/ubl/
منابع مشابه
Fast Answering of XPath Query Workloads on Web Collections
Several web applications (such as processing RSS feeds or web service messages) rely on XPath-based data manipulation tools. Web developers need to use XPath queries effectively on increasingly larger web collections containing hundreds of thousands of XML documents. Even when tasks only need to deal with a single document at a time, developers benefit from understanding the behaviour of XPath ...
متن کاملExploring PSI-MI XML Collections Using DescribeX
PSI-MI has been endorsed by the protein informatics community as a standard XML data exchange format for protein-protein interaction datasets. While many public databases support the standard, there is a degree of heterogeneity in the way the proposed XML schema is interpreted and instantiated by different data providers. Analysis of schema instantiation in large collections of XML data is a ch...
متن کاملQuerying Xml Document Collections
In this paper we describe a query interface towards XML document collections. External schema annotation in RDF contains information used to dynamically build the interface tailored to the user’s characteristics and to the document structure, as described by its XML Schema. The interface makes the user aware of structure semantics, so supporting her/him in formulating semantically correct queri...
متن کاملOntology-Based XQuery'ing of XML-Encoded Language Resources on Multiple Annotation Layers
We present an approach for querying collections of heterogeneous linguistic corpora that are annotated on multiple layers using arbitrary XML-based markup languages. An OWL ontology provides a homogenising view on the conceptually different markup languages so that a common querying framework can be established using the method of ontology-based query expansion. In addition, we present a highly...
متن کاملQuerying Structured XML Document Collections
The number of XML document collections is increasing, and it’s important to effectively query them. Document semantics is in both the text and the structure. In this paper we describe a query interface towards XML document collections. The interface is automatically tailored to the document structure, as described by its XML Schema. External schema annotation in RDF contains information used to...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/0807.2972 شماره
صفحات -
تاریخ انتشار 2008